Classification

Introduction

Suppose that instead of regression, our goal is classification. In this case $y_i$ is no longer numeric. $\left\{\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right)\right\}$

To initially quantify the accuracy of our classification $\hat{f}$ we regard the training error rate: $$ \frac{1}{n} \sum_{i=1}^{n} I\left(y_{i} \neq \hat{y}_{i}\right) $$ where $\hat{y}_{i}$ is the predicted class for the $i'th$ observation using our prediction from $\hat{f}$ and $I\left(y_{i} \neq \hat{y}_{i}\right)$ is the indicator function, taking the value one if $y_{i} \neq \hat{y}_{i}$ and zero $\text { if } y_{i}=\hat{y}_{i}$

Analogously regression setting: most interested in the error rates that result from applying our classifier to test observations that were not used in training.

The test error rate associated with a set of test observations of the form $\left(x_{0}, y_{0}\right)$ $$ \operatorname{Ave}\left(I\left(y_{0} \neq \hat{y}_{0}\right)\right) $$

The Bayes-Classifier

Possible to show: test error rate given above is minimized, on average, by a very simple classifier that assigns each observation to the most likely class given its predictor values.

Assign a test observation with predictor vector $x_0$ to the class $j$ for which \begin{equation}\label{eq:bayes} \operatorname{Pr}\left(Y=j|X= x_{0}\right) \end{equation}

a conditional probability, i.e. the probability conditional that $Y = j$, given the observed predictor vector $x_0$.

Simulated data set consisting of 100 observations in each of two groups, indicated in blue and in orange. The purple dashed line represents the Bayes decision boundary. The orange background grid indicates the region in which a test observation will be assigned to the orange class, and the blue background grid indicates the region in which a test observation will be assigned to the blue class.

fig_bayes_class.png

The Bayes classifier produces the lowest possible test error rate, Since the Bayes classifier will always choose the class for which (\ref{eq:bayes}) is largest, the error rate at $X = x_0$ will be $1−max_j Pr(Y = ratej|X = x_0)$. In general, the overall Bayes error rate is given by

$$1-E\left(\max _{j} \operatorname{Pr}(Y=j | X)\right)$$

where the expectation averages the probability over all possible values of $X$. For our simulated data, the Bayes error rate is 0.1304. It is greater than zero, because the classes overlap in the true population so $\max _{j} \operatorname{Pr}(Y=j|X=x_{0} )<1$ for some values of $x_0$.

K-nearest neighbor

In theory: we would always like to predict qualitative responses using the Bayes classifier. But: for real data, we do not know the conditional distribution of $Y$ given $X$ $\rightarrow$ computing the Bayes classifier is impossible.

Many approaches attempt to estimate the conditional distribution of $Y$ given $X$, and then classify a given observation to the class with highest estimated probability.

One such method: K-nearest neighbors (KNN) classifier. Given a positive integer $K$ and a test observation $x_0$, KNN classifier first identifies the K points in the training data that are closest to $x_0$, represented by $\mathcal{N}_{0}$.

It then estimates the conditional probability for class $j$ as the fraction of points in $\mathcal{N}_{0}$ whose response values equal $j$: $$ \operatorname{Pr}\left(Y=j | X=x_{0}\right)=\frac{1}{K} \sum_{i \in \mathcal{N}_{0}} I\left(y_{i}=j\right) $$

Finally, KNN applies Bayes rule and classifies the test observation $x_0$ to the class with the largest probability.

fig_K_neighbor_I.png

Classification

In many classification problems the response variable is qualitative, instead of quantitative.

Predicting qualitative responses is called classification.

Predicting a qualitative response for an observation: classifying that observation.

On the other hand:

Predict the probability of each of the categories of a qualitative variable, as basis for classification.

Just as in the regression setting: we have a set of training observations $(x_1, y_1), . . . , (x_n, y_n)$ that we can use to build a classifier.

Classifier should perform well not only on the training data, but on test observations that were not used to train the classifier.

To illustrate: the simulated Default data set:

How to build a model to predict default ($Y$) for any given value of balance ($X_1$) and income ($X2$).

fig_default_I.png

fig_K_neighbor.png

Why not linear regression

We have stated that linear regression is not appropriate in the case of a qualitative response. Why not?

Suppose: predict the medical condition of a patient on the basis of her symptoms.

There are three possible diagnoses: stroke, drug overdose, and epileptic seizure. Could encode these values as a quantitative response variable, $Y$:

\begin{equation} Y=\left\{\begin{array}{ll}{1} & {\text { if stroke }} \\ {2} & {\text { if drug overdose }} \\ {3} & {\text { if epileptic seizure. }}\end{array}\right. \end{equation}

Now, use least squares to fit a linear regression model to predict $Y$ on the basis of a set of predictors $X_1, . . .,Xp$.

But: Coding implies

What about binary variables: only two possibilities for the patient’s medical condition: stroke and drug overdose.

\begin{equation} Y=\left\{\begin{array}{ll}{0} & {\text { if stroke }} \\ {1} & {\text { if drug overdose }}\end{array}\right. \end{equation}

Logistic regression

Logistic regression models the probability that $Y$ belongs to a particular category.

For the Default data example logistic regression models the probability of default:

\begin{equation} \operatorname{Pr}(\text { default }=\text { Yes } | \text { balance })=p(balance)=[0,1] \end{equation}

To make predictions, we might predict $default = Yes$ for any individual for whom $p(balance) > 0.5.$

In lin.reg, we model $p(X)$ directly

\begin{equation} p(X)=\beta_{0}+\beta_{1} X \end{equation}

In logistic regression: use the logistic function: \begin{equation} p(X)=\frac{e^{\beta_{0}+\beta_{1} X}}{1+e^{\beta_{0}+\beta_{1} X}} \end{equation} Calculate the odds, with values between $0$ and $\infty$.

\begin{equation} \frac{p(X)}{1-p(X)}=e^{\beta_{0}+\beta_{1} X} \end{equation}

How to interpret odds: Values of the odds close to $0$ and $\infty$ indicate very low/ very high probabilities of default.

For example, on average 1 in 5 people with an odds of $1/4$ will default, since $p(X) = 0.2$ implies an odds of $\frac{0.2}{1−0.2} = 1/4$.% Likewise on average nine out of every ten people with an odds of 9 will default, since $p(X) = 0.9$ implies an odds of $\frac{0.9}{1−0.9} = 9.$

Taking the logarithm on both sides yields the log-odds or logit.

\begin{equation} \log \left(\frac{p(X)}{1-p(X)}\right)=\beta_{0}+\beta_{1} X \end{equation}

However, because the relationship between $p(X)$ and $X$ is not a straight line, $\beta_1$ does not correspond to the change in $p(X)$ associated with a one-unit increase in $X$.

The amount that $p(X)$ changes due to a one-unit change in $X$ will depend on the current value of $X$.

But regardless of the value of $X$, if $\beta_1$ is positive then increasing $X$ will be associated with increasing $p(X)$, and if $\beta_1$ is negative then increasing $X$ will be associated with decreasing $p(X)$.

fig_log_reg_I.png

But: how to interpret the coefficients?

fig_logit_coefficients.png

Better: predictions

For example: student with a credit card balance of $1, 500 and an income of $40,000 has an estimated probability For example: student with a credit card balance of 1,500 and an income of 40,000 has an estimated probability of default of

\begin{equation} \hat{p}(X)=\frac{e^{-10.869+0.00574 \times 1,500+0.003 \times 40-0.6468 \times 1}}{1+e^{-10.869+0.00574 \times 1,500+0.003 \times 40-0.6468 \times 1}}=0.058 \end{equation}

A non-student with the same balance and income has an estimated probability of default of

\begin{equation} \hat{p}(X)=\frac{e^{-10.869+0.00574 \times 1,500+0.003 \times 40-0.6468 \times 0}}{1+e^{-10.869+0.00574 \times 1,500+0.003 \times 40-0.6468 \times 0}}=0.105 \end{equation}
  1. Simulate this model with the probabilities as described above with the following values:
  1. Estimate $\beta_0, \beta_1, \beta_2$ via maximum likelihood and calculate the standard errors.

Linear Discriminant Analysis

Previously: model the conditional distribution of $Y$, given $X$.

In LDA: model the distribution of the predictors $X$ separately in each of the response classes (i.e. given $Y$ ), and use Bayes’ theorem to flip these around into estimates for $\operatorname{Pr}(Y=k | X=x)$.

Suppose that we have $K$ classes with $K>2$.

Then Bayes' theorem states that

\begin{equation}\label{eq:bayes_theo} p_{k}(X)=\operatorname{Pr}(Y=k | X=x)=\frac{\pi_{k} f_{k}(x)}{\sum_{l=1}^{K} \pi_{l} f_{l}(x)} \end{equation}

and we call $p_{k}(X)$ the posterior probability.

In order to estimate this: find estimates for

LDA for $p=1$

Suppose that $p=1$: assuming the normal density we can write

\begin{equation} f_{k}(x)=\frac{1}{\sqrt{2 \pi} \sigma_{k}} \exp \left(-\frac{1}{2 \sigma_{k}^{2}}\left(x-\mu_{k}\right)^{2}\right) \end{equation}

where $\mu_{k} \text { and } \sigma_{k}^{2}$ are the mean and variance parameters for the k'th class.

Further suppose that $\sigma_{1}^{2}=\ldots=\sigma_{K}^{2}$, we can write equation \ref{eq:bayes_theo}

\begin{equation}\label{eq:bayes_norm} p_{k}(x)=\frac{\pi_{k} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}\left(x-\mu_{k}\right)^{2}\right)}{\sum_{l=1}^{K} \pi_{l} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}\left(x-\mu_{l}\right)^{2}\right)} \end{equation}

Bayes` classifier involves assigning an observation $X = x$ to the class for which equation \eqref{eq:bayes_norm} is largest. Taking the log of \eqref{eq:bayes_norm} and rearranging the terms, we can show that this is equivalent to assigning the observation to the class for which

\begin{equation}\label{eq:lda_delta_x} \delta_{k}(x)=x \cdot \frac{\mu_{k}}{\sigma^{2}}-\frac{\mu_{k}^{2}}{2 \sigma^{2}}+\log \left(\pi_{k}\right) \end{equation}

is largest.

For instance, if $ K = 2$ and $\pi_1 = \pi_2$, then the Bayes` classifier assigns an observation to class 1 if $2x (\mu_1 − \mu_2) > \mu^2_1-\mu_2^2$, and to class 2 otherwise. In this case, the Bayes decision boundary corresponds to the point where

\begin{equation*} x=\frac{\mu_{1}^{2}-\mu_{2}^{2}}{2\left(\mu_{1}-\mu_{2}\right)}=\frac{\mu_{1}+\mu_{2}}{2} \end{equation*}

fig_bayes_norm_I.png

Even if we were certain that $f_k$ is normal, we still need to estimate

\begin{equation} \begin{aligned} \hat{\mu}_{k} &=\frac{1}{n_{k}} \sum_{i : y_{i}=k} x_{i} \\ \hat{\sigma}^{2} &=\frac{1}{n-K} \sum_{k=1}^{K} \sum_{i : y_{i}=k}\left(x_{i}-\hat{\mu}_{k}\right)^{2} \end{aligned} \end{equation}

where $n$ is the total number of training observations, and $n_k$ is the number of training observations in the $k'th$ class.

Without additional information about class membership, we estimate \begin{equation} \hat{\pi}_{k}=n_{k} / n \end{equation}

Finally,

\begin{equation} \hat{\delta}_{k}(x)=x \cdot \frac{\hat{\mu}_{k}}{\hat{\sigma}^{2}}-\frac{\hat{\mu}_{k}^{2}}{2 \hat{\sigma}^{2}}+\log \left(\hat{\pi}_{k}\right) \end{equation}

where the discriminant factor $\delta$ is a linear function of $x$ (hence the name).

LDA for $p>1$

We assume that

$X=\left(X_{1}, X_{2}, \ldots, X_{p}\right)$ are $X \sim N(\mu, \Sigma)$ with a \emph{class-specific} multivariate mean vector and a common covariance matrix.

For $p > 1$ predictors: LDA classifier assumes that the observations in the $k'th$ class are drawn from a multivariate Gaussian distribution $N\left(\mu_{k}, \mathbf{\Sigma}\right)$ where $\mu_k$ is a class-specific mean vector and $\mathbf{\Sigma}$ is a covariance matrix common to all classes.

The resulting Bayes classifier can be shown to have the following form (after plugging the density function for the $k \text { th class }, f_{k}(X=x))$ into \ref{eq:bayes_theo}),i.e. the Bayes classifier assigns an observation $X = x$ to the class for which

\begin{equation}\label{eq:lda_mult} \delta_{k}(x)=x^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}-\frac{1}{2} \mu_{k}^{T} \boldsymbol{\Sigma}^{-1} \mu_{k}+\log \pi_{k} \end{equation}

is largest.

The decision boundaries represent the set of values for which $\delta_k(x)=\delta_l(x)$

\begin{equation} x^{T} \Sigma^{-1} \mu_{k}-\frac{1}{2} \mu_{k}^{T} \Sigma^{-1} \mu_{k}=x^{T} \Sigma^{-1} \mu_{l}-\frac{1}{2} \mu_{l}^{T} \Sigma^{-1} \mu_{l} \end{equation}

fig_lda_mult.png

In order to make use of this:

Example: Default Data We can perform LDA on the Default data in order to predict whether or not an individual will default on the basis of credit card balance and student status. The LDA model fit to the 10, 000 training samples results in a training error rate of 2.75\%. Is this low?

Sensitivity, Specificity and total Error

The Null classifier is only a bit higher than the LDA training set error rate! What is the problem?

Need classifiers that can accommodate class-specific performance: Sensitivity and Specificity.

In this case the sensitivity is the percentage of true defaulters that are identified, a low 24.3\% in this case. The specificity is the percentage of non-defaulters that are correctly identified, here ((1-23)/9, 667)× 100 = 99.8\%.

fig_lda_conf.png

As we have seen, LDA is trying to approximate the Bayes classifier, which has the lowest total error rate out of all classifiers.

The Bayes classifier works by assigning an observation to the class for which the posterior probability is the largest. In the two class case: assign an observation to the \verb|default| class if

\begin{equation} \operatorname{Pr}(\operatorname{def} \operatorname{aul} \mathrm{t}=\operatorname{Yes} | X=x)>0.5 \end{equation}

Thus: Bayes classifier (and by extension LDA), uses a threshold of 50\% for the posterior probability of default in order to assign an observation to the default class.

However, if we are concerned about incorrectly predicting the default status for individuals who default, then we can consider lowering this threshold. For instance, we might label any customer with a posterior probability of default above 20\% to the default class. \begin{equation} P(\operatorname{def} \operatorname{au} 1 \mathrm{t}=\operatorname{Yes} | X=x)>0.2 \end{equation}

fig_lda_conf_II.png

fig_lda_conf_III.png

The ROC curve: simultaneously displays the two types of errors for all possible thresholds.

fig_lda_conf_VI.png

QDA and Comparison of Classifiers

Quadratic Discriminant Analysis

Quadratic discriminant analysis (QDA) QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes' theorem in order to perform prediction.

However, unlike LDA, QDA assumes that each class has its own covariance matrix, i.e. observation from the $k'th$ class is of the form $X \sim N\left(\mu_{k}, \boldsymbol{\Sigma}_{k}\right)$ where $\boldsymbol{\Sigma}_{k}$ is a covariance matrix for the $k'th$ class. Under this assumption, the Bayes classifier assigns an observation $X = x$ to the class for which

\begin{equation}\label{eq_qda_delta_x} \delta_{k}(x) =-\frac{1}{2}\left(x-\mu_{k}\right)^{T} \boldsymbol{\Sigma}_{k}^{-1}\left(x-\mu_{k}\right)+\log \pi_{k} \\ =-\frac{1}{2} x^{T} \boldsymbol{\Sigma}_{k}^{-1} x+x^{T} \boldsymbol{\Sigma}_{k}^{-1} \mu_{k}-\frac{1}{2} \mu_{k}^{T} \boldsymbol{\Sigma}_{k}^{-1} \mu_{k}+\log \pi_{k} \end{equation}

is largest. Unlike in \eqref{eq:lda_mult} the quantity $x$ appears as a quadratic function in \eqref{eq_qda_delta_x}.

So the QDA classifier involves plugging estimates for $\boldsymbol{\Sigma}_{k}, \mu_{k}$ and $\pi_k$.

Why would one prefer LDA to QDA, or vice-versa?

$\rightarrow$ bias-variance trade-off: with

The answer lies in the bias-variance trade-off. With $p$ predictors estimating a covariance matrix entails estimating $p(p+1)/2$ parameters.

Estimating covariance matrix for each class $\rightarrow$ $K \, p(p+1)/2 $.

Assuming that the $K$ classes share a common covariance matrix, the LDA model becomes linear in $x$, which means there are $Kp$ linear coefficients to estimate.

Consequently: LDA is a much less flexible classifier than QDA $\rightarrow$ lower variance.

But: if LDA’s assumption that the K classes share a common covariance matrix is badly off, then LDA $\rightarrow$ high bias.

fig_LDA_QDA.png

A Comparison of Classification Methods

Logistic regression and LDA: linear decision boundaries. The difference: the way the coefficients are estimated and some assumptions.

LDA: observations are drawn from a Gaussian distribution with a common covariance matrix in each class

KNN: nonparametric approach. no assumptions about the shape of the decision boundary,i.e. KNN$>$ LDA and logistic regression when the decision boundary is highly non-linear. But: no coefficient estimates.

QDA: compromise between the non-parametric KNN method and the linear LDA and logistic regression approaches. Not as flexible as KNN, but performs better with limited number of training observations.